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Abstract 


We  generalize  the  notion  of  PAC  learning  from  an  example  oracle  to  a  notion  of  efficient 
learning  on  a  quantum  computer  using  a  quantum  example  oracle.  This  quantum  example 
oracle  is  a  natural  extension  of  the  traditional  PAC  example  oracle,  and  it  immediately  follows 
that  all  PAC-learnable  function  classes  are  learnable  in  the  quantum  model.  Furthermore,  we 
obtain  positive  quantum  learning  results  for  classes  that  are  not  known  to  be  PAC  learnable. 
Specifically,  we  show  that  DNF  is  efficiently  learnable  with  respect  to  the  uniform  distribution 
by  a  quantum  algorithm  using  a  quantum  example  oracle.  While  it  was  already  known  that 
DNF  is  uniform- learnable  using  a  membership  oracle,  the  quantum  example  oracle  is  provably 
less  powerful  than  a  membership  oracle.  We  also  generalize  the  notion  of  classification  noise 
to  the  quantum  setting  and  show  that  the  quantum  DNF  algorithm  learns  even  in  the 
presence  of  such  noise.  This  result  contrasts  with  a  recent  negative  result  for  DNF  in  the 
statistical  query  model  of  learning  from  noisy  data.  _ 
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1  Introduction 


Recently,  there  has  been  increasing  interest  in  the  question  of  whether  or  not  quantum  phys¬ 
ical  effects  can  be  used  to  solve  problems  that  appear  to  be  computationally  difficult  using 
traditional  methods.  In  this  paper,  we  apply  quantum  methods  to  questions  in  computa¬ 
tional  learning  theory.  In  particular,  we  focus  on  the  problem  of  learning — from  examples 
alone — the  class  DNF  of  polynomial-size  Disjunctive  Normal  Form  expressions. 

The  DNF  learning  problem  has  a  long  history.  Valiant  [18]  introduced  the  problem 
and  gave  efficient  algorithms  for  learning  certain  subclasses  of  DNF.  Since  then,  learning 
algorithms  have  been  developed  for  a  number  of  other  subclasses  of  DNF  [13,  4,  3,  11.  2, 
1,  7,  16,  9]  and  recently  for  the  unrestricted  class  of  DNF  expressions  [6,  12],  but  almost 
all  of  these  results — and  in  particular  the  results  for  the  unrestricted  class — use  membership 
queries  (the  learner  is  told  the  output  value  of  the  target  function  on  learner-specified  inputs). 
This  has  left  open  the  question  of  to  what  extent  membership  queries  are  necessary  for  DNF 
learning,  even  in  models  where  the  learner  is  only  required  to  produce  a  hypothesis  that 
weakly  approximates  the  target  DNF  expression  with  respect  to  the  uniform  distribution 
(definitions  are  given  in  the  next  section). 

We  show  that  DNF  is  efficiently  learnable  with  respect  to  the  uniform  distribution  by  a 
quantum  algorithm  that  receives  its  information  about  the  target  function  from  a  quantum 
example  oracle.  This  oracle  generalizes  the  traditional  PAC  example  oracle  in  a  natural  way. 
Specifically,  the  quantum  oracle  QEX{f,D)  is  a  traditional  PAC  example  oracle  E.X{f,  D) 
except  that  QEX{f,D)  produces  the  example  (x, /(x))  with  amplitude  \JD(x)  rather  than 
with  probability  jD(x).  We  show  that,  with  respect  to  the  uniform  distribution,  a  quantum 
example  oracle  can  be  simulated  by  a  membership  oracle  but  not  vice  versa.  Thus  our  result 
can  be  viewed  as  evidence  that  DNF  is  learnable  without  the  full  power  of  membership 
queries. 

Furthermore,  we  generalize  the  notion  of  classification  noise  and  show  that  our  algorithm 
learns  DNF  even  if  the  quantum  example  oracle  exhibits  such  noise.  This  is  particularly- 
interesting  in  light  of  recent  results  of  Blum  et  al.  [6]  showing  that  DNF  is  not  learnable  in 
the  statistical  query  (SQ)  model.  Because  SQ  learning  is  conjectured  to  be  equivalent  to  the 
model  of  PAC  learning  with  classification  noise,  our  result  is  evidence  that  quantum  learning 
algorithms  may  be  better  able  to  handle  noise  than  traditional  algorithms. 

To  obtain  our  quantum  DNF  learning  algorithm,  we  modify  the  recent  Harmonic  Sieve 
algorithm  (HS)  for  learning  DNF  with  respect  to  uniform  using  membership  queries  [12]. 
In  fact,  HS  properly  learns  the  larger  class  PTy  of  functions  expressible  as  a  threshold  of  a 
polynomial  number  of  parity  functions,  and  our  algorithm  properly  learns  this  class  as  well. 
The  Harmonic  Sieve  uses  membership  queries  to  locate  parity  functions  that  correlate  well 
with  the  target  function  with  respect  to  various  near-uniform  distributions.  The  heart  of 
our  result  is  showing  that  these  parities  can  be  located  efficiently  by  a  quantum  algorithm 
using  only  a  quantum  example  oracle. 
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2  Definitions  and  Notation 

2.1  Functions  and  Function  Classes 

We  will  be  interested  in  the  learnability  of  sets  (classes)  of  Boolean  functions.  The  Boolean 
functions  we  consider  are,  unless  otherwise  noted,  of  the  type  /  :  {0, 1}"  — *■  {0, 1}  for  fixed 
positive  values  of  n.  We  call  {0,1}"  the  instance  space  of  /,  an  element  x  in  the  instance 
space  an  instance,  and  the  pair  {x,f{x))  an  example  of  /.  We  denote  by  Xi  the  ith  bit  of 
instance  x. 

Intuitively,  a  learning  algorithm  should  be  allowed  to  run  in  time  polynomial  in  the 
complexity  of  the  function  /  to  be  learned;  we  will  use  the  size  of  a  function  as  a  measure 
of  its  complexity.  The  size  measure  will  depend  on  the  function  class  to  be  learned.  In 
particular,  each  function  class  that  we  study  implicitly  defines  a  natural  class  'JZjr  of 
representations  of  the  functions  in  We  define  the  size  of  a  function  f  ^  T  as  the 
minimum,  over  all  r  €  such  that  r  represents  /,  of  the  size  of  r,  and  we  define  below  the 
size  measure  for  each  representation  class  of  interest. 

A  DNF  expression  is  a  disjunction  of  terms,  where  each  term  is  a  conjunction  of  literals 
and  a  literal  is  either  a  variable  or  its  negation.  The  size  of  a  DNF  expression  r  is  the  number 
of  terms  in  r.  The  DNF  function  class  is  the  set  of  all  functions  that  can  be  represented  as 
a  DNF  expression  of  size  polynomial  in  n. 

Following  Bruck  [8],  we  use  PTi  to  denote  the  class  of  functions  on  (0, 1}"  expressible 
as  a  depth-2  circuit  with  a  majority  gate  at  the  root  and  polynomially-many  parity  gates  at 
the  leaves.  All  gates  have  unbounded  fanin  and  fanout  one.  The  size  of  a  PT \  circuit  r  is 
the  number  of  parity  gates  in  r. 


2.2  Quantum  Turing  Machines 

We  now  review  the  model  of  quantum  computation  defined  by  Bernstein  and  Vazarani  [5]. 
First  we  define  how  the  specification  (program)  of  a  quantum  Turing  machine  (QTM)  is 
written  down.  Then  we  describe  how  a  QTM  operates. 

The  specification  of  a  QTM  is  exactly  the  same  as  the  specification  of  a  probabilis¬ 
tic  TM,  except  the  transition  probabilities  between  PTM  configurations  are  replaced  in  a 
QTM  specification  with  complex- valued  numbers  (amplitudes)  that  satisfy  a  certain  well- 
formedness  property.  We  define  well-formedness  as  follows.  For  a  QTM  M,  let  Rm  be  the 
(infinite-dimensional)  matrix  where  each  row  and  each  column  is  labeled  with  a  machine 
configuration  (c,  and  Cc,  respectively)  and  each  entry  in  Rm  is  the  amplitude  assigned  by  M 
to  the  transition  from  configuration  Cc  to  Cp.  Then  M  satisfies  the  well-formedness  property 
if  Rm  is  unitary  (RmRm  =  RmRm  =  h  where  R^  is  the  conjugate  transpose  of  Rm).  A 
QTM  specification  also  contains  a  set  of  states  (including  all  of  the  final  states)  in  which  an 
Obs  operation  is  performed;  we  define  this  operation  below. 

To  describe  the  operation  of  a  QTM,  we  use  the  notion  of  a  superposition  of  configurations. 
For  example,  consider  a  probabilistic  Turing  machine  M'  that  at  step  i  flips  a  fair  coin  and 
chooses  to  transition  to  one  of  two  configurations  C\  and  C2.  While  we  would  generally  think 
of  M’  as  being  in  exactly  one  of  these  configurations  at  step  i  -i- 1,  we  can  equivalently  think 
of  M'  as  being  in  both  states,  each  with  probability  1/2.  Continuing  in  this  fashion,  for 
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each  step  until  M'  terminates  we  can  think  of  M'  as  being  in  a  superposition  of  states, 
each  state  with  an  associated  probability.  After  M'  takes  its  final  step,  each  of  its  final 
states  will  have  some  associated  probability  (we  assume  without  loss  of  generality  that  all 
computation  paths  in  M'  have  the  same  length).  If  M'  now  “chooses”  to  be  in  one  of  these 
final  states  <7/  randomly  according  to  the  induced  probability  distribution  on  final  states, 
then  the  probability  of  being  in  <7/  is  exactly  the  same  in  this  model  as  it  is  in  the  traditional 
PTM  model. 

In  summary,  we  can  view  a  PTM  M'  as  being  in  a  superposition  of  configurations  at 
each  step,  where  a  superposition  is  represented  by  a  vector  of  probabilities,  one  for  each 
possible  configuration  of  M'.  Likewise,  we  view  a  QTM  M  as  being  in  a  superposition  of 
configurations  at  each  step,  but  now  the  superposition  vector  contains  an  amplitude  for  each 
possible  configuration  of  M.  The  initial  superposition  vector  in  both  cases  is  the  all-zero 
vector  except  for  a  single  1  in  the  position  corresponding  to  the  initial  configuration  of  the 
machine.  Note  that  each  step  of  a  PTM  M'  can  be  accomplished  by  multiplying  the  current 
superposition  vector  by  a  matrix  R\f>  which  is  defined  analogously  with  R\f  above.  In  the 
same  way,  each  step  of  a  QTM  M  is  accomplished  by  multiplying  its  current  superposition 
vector  by  Rm-  The  difference  between  the  machines  comes  at  the  point(s)  where  M  “chooses” 
to  be  in  a  single  configuration  rather  than  in  a  superposition  of  configurations.  M  does  this 
(conceptually)  by  transitioning  to  a  superposition  of  configurations  all  of  which  are  in  one  of 
the  Obs  states  mentioned  above.  The  superposition  vector  is  then  changed  so  that  a  single 
configuration  has  amplitude  1  and  all  others  are  0.  This  is  exactly  analogous  to  the  PTM  M' 
choosing  its  final  state,  except  that  the  probability  of  choosing  each  configuration  c,  is  now 
the  square  of  the  magnitude  of  the  amplitude  associated  with  Ci  in  M’s  current  superposition 
vector.  (We  formalize  the  definition  of  Obs  below.) 

We  adopt  Simon’s  notation  [17]  and  write 


to  denote  a  superposition  of  configurations  x  each  having  amplitude  a^.  While  in  general 
this  sum  is  over  all  possible  configurations  of  the  QTM,  when  we  use  this  notation  it  will 
be  the  case  that  all  of  the  configurations  having  nonzero  amplitude  are  in  the  same  state 
and  have  the  tape  head  at  the  same  position.  In  this  case,  the  configurations  .r  are  only 
distinguished  by  their  tape  contents,  so  we  will  treat  x  as  if  it  is  merely  the  tape  content 
and  ignore  other  configuration  parameters. 

Given  this  notation,  we  formally  define  the  Obs  operation  as  follows.  Let  6  €  {0,  1}. 
Then 


with  probability  |ooxP 
with  probability  Ylx 


Note  that  by  permuting  bits  of  the  tape  and  performing  successive  Obs  operations  we  can 
simulate  the  informal  definition  of  Obs  given  earlier.  We  say  that  a  language  L  is  in  BQP 
if  there  exists  a  QTM  M  such  that,  at  the  end  of  a  polynomial  number  of  steps  by  M,  an 
Obs  fixes  the  first  tape  cell  to  1  with  probability  at  least  2/3  if  the  input  is  in  L  and  fixes 
it  to  0  with  probability  at  least  2/3  otherwise.  We  will  also  sometimes  think  of  an  Obs  as 
simply  computing  the  probability  that  the  first  ceil  will  be  fixed  to  I. 
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2.3  Learning  Models 

We  being  by  defining  the  well-known  PAC  learning  model  and  then  generalize  this  to  a 
quantum  model  of  learning.  First,  we  define  several  supporting  concepts.  Given  a  function 
/  and  probability  distribution  D  on  the  instance  space  of  /,  we  say  that  function  h  is  an 
e-approximator  for  f  xvith  respect  to  D  if  Frolh  =  /]  >  1  —  e.  An  example  oracle  for  f 
with  respect  to  D  {EX{f^D))  is  an  oracle  that  on  request  draws  an  instance  x  at  random 
according  to  probability  distribution  D  and  returns  the  example  {x,f{x)).  A  membership 
oracle  for  f  {MEM{f))  is  an  oracle  that  given  any  instance  x  returns  the  value  f{x).  Let 
Vn  denote  a  nonempty  set  of  probability  distributions  on  {0,1}".  Any  set  V  =  UnPn  is 
called  a  distribution  class.  We  let  represent  the  uniform  distribution  on  {0, 1}”  and  call 
U  =  UnWn  simply  the  uniform  distribution. 

Now  we  formally  define  the  Probably  Approximately  Correct  (PAC)  model  of  learnability 
[18].  Let  e  and  6  be  positive  values  (called  the  accuracy  and  confidence  of  the  learning 
procedure,  respectively).  Then  we  say  that  the  function  class  T  is  (strongly)  PAC  learnable 
if  there  is  an  algorithm  A  such  that  for  any  e  and  6,  any  f  £  IF  (the  target  function),  and 
any  distribution  D  on  the  instance  space  of  /  (the  target  distribution),  with  probability  at 
least  1  —  ^  algorithm  A{EX{f,D),e,8)  produces  an  c-approximation  for  /  with  respect  to 
D  in  time  polynomial  in  n,  the  size  of  /,  1/c,  and  1/6.  The  probability  that  A  succeeds  is 
taken  over  the  random  choices  made  by  EX  and  A  (if  any).  We  generally  drop  the  “PAC” 
from  “PAC  learnable”  when  the  model  of  learning  is  clear  from  context. 

Next,  we  consider  learning  using  a  QTM.  First,  note  that  each  call  to  the  traditional  PAC 
example  oracle  EX(f,  D)  can  be  viewed  as  defining  a  superposition  of  2"  configurations,  each 
containing  a  distinct  (x,  f{x))  pair  and  having  probability  of  occurrence  D{x).  We  generalize 
this  to  the  quantum  setting  in  a  natural  way.  A  quantum  example  oracle  for  f  with  respect 
to  D  {QEX{f,D))  is  an  oracle  running  coherently  with  a  QTM  M  that  changes  M's  tape 
|y)  to 

'^y/D{x)\y,x,f{x)). 

X 

That  is,  QEX  defines  a  superposition  of  2"  configurations  much  as  EX  does,  but  QEX 
assigns  each  configuration  an  amplitude  y^D{x).  Note  that  a  call  to  QEX  followed  by  an 
Obs  operation  is  equivalent  to  a  call  to  EX.  We  say  that  F  is  quantum  learnable  if  F  is 
PAC  learnable  by  a  QTM  M  using  a  quantum  example  oracle.  Because  every  efficient  TM 
computation  can  be  simulated  efficiently  by  a  QTM  [5]  and  because  EX  can  be  simulated 
by  QEX,  we  have  that  every  PAC-learnable  function  class  is  also  quantum  learnable. 

We  will  consider  several  variations  on  the  basic  PAC  and  quantum  models.  Let  M  be 
any  model  of  learning  (e.g.,  PAC).  If  F  is  A^-learnable  by  an  algorithm  A  that  requires  a 
membership  oracle  then  F  is  M-leamable  using  membership  queries.  If  F  is  Ad-learnable 
for  c  =  1/2  —  l/pin,  s),  where  p  is  a  fixed  polynomial  and  s  is  the  size  of  /,  then  F  is  weakly 
M-leamable.  We  say  that  F  is  M-leamable  by  H  \(  F  is  Ad-learnable  by  an  algorithm  A 
that  always  outputs  a  function  h  ^  H.  If  .F  is  Al-learnable  by  F  then  we  say  that  F  is 
properly  M-XeaxmAAe.  Finally,  note  that  the  PAC  model  places  no  restriction  on  the  example 
distribution  D",  we  sometimes  refer  to  such  learning  models  as  distribution-independent.  If 
F  is  Af-leamable  for  all  distributions  D  in  distribution  class  P  then  F  is  M -learnable  with 
respect  to  V. 
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2.4  The  Fourier  Transform 

For  each  bit  vector  a  €  {0, 1}"  we  define  the  function  Xa  •  {0?  1}”  {  —  1,  +1}  as 

Xo(a;)  =  =  1—2  I  ^  OiXi  mod  2 

\i=i 

That  is,  Xo(a:)  is  the  boolean  function  that  is  1  when  the  parity  of  the  bits  in  x  indexed  by 
a  is  even  and  is  —1  otherwise.  With  inner  product  defined  by*  {f,g)  =  and  norm 

defined  by  ||/||  =  y/E[P],  {Xa  I  a  €  {0, 1}"}  is  an  orthonormal  basis  for  the  vector  space  of 
real-valued  functions  on  the  Boolean  cube  ZJ.  That  is,  every  function  /  :  {0, 1}"  — >■  R  can 
be  uniquely  expressed  as  a  linear  combination  of  parity  functions: 

/  =  5I/(«)Xa, 


where  f{a)  =  E[/Xo]-  We  call  the  vector  of  coefficients  /  the  Fourier  transform  of  /.  Note 
that  for  Boolean  /,  /(a)  represents  the  correlation  of  /  and  Xa  with  respect  to  the  uniform 
distribution.  Also  note  that  /(0)  =  E[/x0]  =  E[/],  since  xu  is  the  constant  function  -f-1. 

Paxseval’s  identity  states  that  for  every  function  /,  E[/^]  =  For  Boolean  /  it 

follows  that  =  1-  More  generally,  it  can  be  shown  that  for  any  functions  /  and  g, 

E(/j1  =  Ea  /(<-)9(«). 

3  DNF  Leairning 

In  this  section  we  present  our  primary  result,  that  DNF  is  quantum  learnable  with  respect  to 
the  uniform  distribution.  Our  result  builds  on  the  HS  algorithm  for  learning  DNF  with  respect 
to  the  uniform  distribution  using  membership  queries  [12].  The  HS  algorithm  depends  on  a 
key  fact  about  DNF  expressions:  for  every  DNF  /  with  s  terms  and  for  every  distribution 
D  there  exists  a  parity  Xa  such  that  |Ed[/xo]|  >  l/(2s  -|-  1)  [12].  It  follows  immediately 
from  this  fact  that  for  every  DNF  /  and  distribution  D  there  is  a  parity  Xa  that  is  a  weak 
approximator  to  /  with  respect  to  D.  Furthermore,  for  many  probability  distributions 
(e.g.,  distributions  such  that  D{x)  <  p(l/e)/2’*  for  all  x  and  for  p  a  fixed  polynomial),  an 
algorithm  of  Kushilevitz  and  Mansour  [15]  can  be  used  to  efficiently  find  such  a  \a.  The 
Kushilevitz-Mansour  algorithm  is  the  only  aspect  of  HS  that  requires  membership  queries. 
Finally,  an  algorithm  of  Freund  [10]  is  employed  to  boost  the  Kushilevitz-Mansour  weak 
learning  algorithm  into  a  strong  learner. 

The  HS  algorithm  and  its  primary  subroutine  WDNF  are  sketched  in  Figures  1  and  2  (ci 
and  cj  represent  fixed  constants).  The  HS  algorithm  runs  for  at  most  k  steps,  or  stages. 
rj(a:)  represents  the  number  of  weak  hypotheses  wj  among  those  hypotheses  produced  before 
stage  i  that  are  “right”  on  x.  At  each  stage  i,  the  boosting  algorithm  implicitly  defines 
a  distribution  Di(x)  based  on  r,(x);  this  is  the  distribution  for  which  the  weak  learner  is 
expected  to  produce  a  weak  hypothesis.  We  can  generate  x’s  according  to  this  distribution 

*  Expectations  and  probabilities  here  and  elsewhere  are  with  respect  to  the  uniform  distribution  over  the 
instance  space  unless  otherwise  indicated. 
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Invocation:  h  *— HS{n,s,MEM{f),e,6) 

Input:  n;  5  =  size  of  DNF  /;  MEM{f)\  c  >  0;  6  >  0 

Output:  with  probability  at  least  1  —  ^  (over  random  choices  made  by  HS),  HS  returns  h  such 
that  Pr[/  =  h]  >  1  -  c 

1.  A:  <— cis^  log(l/c) 

2.  B{j;n,p)  =  (")p'(l  -p)""-' 

3.  =  B([A:/2J  —  r;  A;  —  i  —  1, 1/2  +  l/(4s  +  2))  if  i  —  k/2  <  r  <  k/2,  /d*  =  0  otherwise 

4.  a‘  =  /9j/maXr=o . 

5.  Wo  «— WDNF(n,s,  MfJM(/),i/„,6/2A:) 

6.  for  i  <—  1, . . . ,  A:  —  1  do 

7.  r.(a;)  =  |{0  <j<i  |  wjix)  =  /(x)}| 

8.  accept  +-  Est(Ex[Q:‘.(j.)],C2e^/3,6/2A:) 

9.  if  accept  <  2c2e^/3  then 

10.  k  *—  i 

11.  break  do 

12.  endif 

13.  Di{x)  =  a*,  jj.j/2"  accept 

14.  Wi  *—VDlfF{n,s,MEM{f),bi{x),8f2k) 

15.  enddo 

16.  return  h{x)  =  MAJ{wo{x),  Wi(x), . . . , Wk-i{x)) 


Figure  1:  Harmonic  Sieve  algorithm  for  learning  DNF. 


as  follows.  Choose  an  instance  x  uniformly  at  random  and  flip  a  coin  that  comes  up  heads 
with  probability  Q!r;(x)  (o:  is  a  scaled  binomial  distribution).  If  the  coin  comes  up  heads, 
output  X.  Otherwise,  select  a  new  x  uniformly  at  random  and  flip  the  coin  again.  Repeat 
this  process  until  some  x  is  output. 

Thus 


Diix)  = 


ct 


ri(x) 


(1) 


In  order  for  the  Kushilevitz-Mansour  algorithm  to  find  a  good  weak  approximator  with 
respect  to  distribution  Z?,',  we  need  to  be  able  to  closely  approximate  Di(x)  for  every  value 
of  X.  The  function  Di  is  HS’s  estimate  of  Di.  Note  that  because  of  the  bound  on  the 
variable  accept  and  the  accuracy  with  which  accept  estimates  with  high  probability 

Ex[aJ,.(j.)l  =  czaccept  for  fixed  C3  6  [1/2, 3/2],  and  thus  Di{x)  =  czDiix)  for  all  x. 

We  have  omitted  the  details  of  line  2  of  WDNF  because  this  is  the  main  point  at  which  our 
quantum  algorithm  will  differ  from  HS.  Rather  than  using  membership  queries  to  locate  the 
required  parity  Xa,  the  new  algorithm  will  use  a  quantum  example  oracle.  Both  algorithms 
depend  on  a  key  fact  about  DNF  expressions:  for  every  DNF  /  with  s  terms  and  for  every 
distribution  D  there  exists  a  parity  Xo  such  that  |Ed[/Xo]|  >  l/(2s  +  1)  [12].  We  will  show 
how  to  find  such  a  parity  Xa  for  any  distribution  A  simulated  by  HS  given  access  only  to  a 
quantum  example  oracle. 
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Invocation:  u>,  <— WDNF(n,s,  A/£^M(/),cZ),^) 

Input:  n;  5  =  size  of  DNF  /;  MEM{fy,  cD,  an  oracle  that  given  x  returns  c  •  Z)(x),  where 
c  is  a  constant  in  [1/2, 3/2]  and  Z?  is  a  probability  distribution  on  {0, 1}";  (5  >  0 
Output:  with  probability  at  least  1  —  6  (over  random  choices  made  by  WDNF),  WDNF  returns 
h  such  that  Prof/  =  h]  >  1/2  +  l/(4s  +  2) 

1.  5i(a;)  =  c2''f{x)D{x) 

2.  find  (using  membership  queries  and  with  probability  at  least  1  —  6)  \a  such  that 
lEof^rXall  >  I /2s  +  1 

3.  return  h{x)  =  sign(ED[5Xa])  •  \a(a:) 


Figure  2:  WDNF  subroutine  called  by  HS. 

Specifically,  consider  the  call  to  WDNF  at  line  14  of  HS  for  a  fixed  i,  and  for  notational 
convenience  let  D  =  D,  and  a(x)  =  Then  with  high  probability  there  is  a  C3  G 

[1/2, 3/2]  such  that  for  all  Xa, 

II/(a:)Xa(a:)A(a:)  =  C3ED[/Xa]  =  E[o/xa]/accep<, 
and  thus  for  some  Xo? 

-  3(2s  +  l)' 

Our  goal  will  be  to  find  such  a  Xa  using  only  a  quantum  example  oracle  for  /.  From  this 
Xa  we  can  produce  a  weak  approximator  for  /,  as  illustrated  by  WDNF.  Conceptually,  to 
find  such  a  Xa  we  will  run  a  quantum  program  that  will  sample  the  \a’s  with  probability 
proportional  to  E^[Q/Xa]-  The  technique  we  use  to  perform  this  sampling  is  similar  to  an 
algorithm  of  Bernstein  and  Vazarani  that  samples  the  Xa’s  with  probability  /^(a)  =  E^[/ \a]. 
However,  there  are  two  difficulties  with  using  their  technique  directly.  First,  their  algorithm 
uses  calls  to  the  fun  tion  /  (membership  queries),  and  we  want  an  algorithm  that  uses  only 
quantum  example  queries.  Second,  their  technique  works  for  Boolean  functions,  but  q  •  /  is 
not  Boolean.  The  following  lemma  addresses  the  first  difficulty. 

Lemma  1  There  is  a  quantum  program  QSAMP  that,  given  any  quantum  example  oracle 
QEX{f),  returns  Xa  with  probability  p{a)/2. 

Proof:  QSAMP  begins  by  calling  QEX(f)  on  a  blank  tape  to  get  the  superposition 

QSAMP  next  replaces  f{x)  with  (1  -  f{x))/2  (call  this  /'(a:));  note  that  (-1)-^'^*'  =  f(x). 
Then  we  will  apply  a  Fourier  operator  F  to  the  entire  tape  contents.  We  define  F  as 

'"(W)  =  iD-irw 

^  V 
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where  |a|  =  n.  This  operation  can  be  performed  in  n  steps  by  a  quantum  Turing  machine 
[5].  Also  recall  that  (—1)“  *'  =  X<x{y)  =  Xu(o)-  Thus  applying  F  gives  us 


^  x^y^z 

=  i^E/Wb’O  +  ^iS.o) 


where  |j/|  =  |x|  =  n  and  l^j  =  1  and  the  final  line  follows  by  orthonormality  of  the  parity 
basis.  An  Obs  operation  at  this  point  produces  |y,  1)  with  probability  P(y)l2,  as  desired. 
□ 

However,  we  want  to  sample  the  parities  according  to  the  coefficients  of  the  non- Boolean 
function  af.  We  will  do  this  indirectly  bv  sampling  over  individual  bits  of  the  function. 
First,  note  that  we  can  limit  the  accuracy  of  a  and  still  compute  an  adequate  approximation 
to  E[a/Xa]«  That  is,  since  0  <  q{x)  <  1  for  all  x,  for  some  d  =  0(log(3/e))  and  0(x)  = 
[2‘*q(x)J  2~^  we  have 

|E1«/X.1I  >  |E(a/x.||  - 

for  all  Xa-  Furthermore,  every  0  <  1  can  be  written  as 


9  =  9i2-^  -I-  $22'^  -I-  •  •  •  -h  9d2-'^  +  k2-'^ 


where  9i  €  {— and  k  €  {  —  1,0,1}.  Thus 

|E(»/x.|l  <  mM  !£;(«, /x,l|  + 

Therefore,  if  Xo  is  a  good  approximator  to  /  then  there  is  some  j  such  that  \E[9jf\a\\  is 
larger  than  l/pi{s,l/e)  for  a  fixed  polynomial  pi.  Furthermore,  the  number  of  such  \a’s  for 
each  j  is  bounded  by  a  pj(s,  1/c)  by  Parseval’s. 

Thus  a  quantum  algorithm  for  learning  DNF  with  respect  to  uniform  is  obtained  by 
modifying  HS  as  follows.  First,  procedure  Est  (line  8  of  HS)  will  now  use  QEX{f)  to  simulate 
the  example  oracle  EX{f)  in  order  to  estimate  expected  values.  Second,  in  order  to  find  a 
good  weak  approximator  (line  2  of  WDNF)  we  will  use  the  quantum  approach  outlined  above. 
That  is,  for  each  value  of  j  we  will  sample  the  Xa’s  using  a  probability  equal  to  E'^[0jf\a\/2. 
We  do  this  by  running  a  modified  QSANP  that,  after  calling  QEX{f),  replaces  /(x)  with 
dj(x)  •  fix).  This  is  a  reversible  operation  because  x  is  still  on  the  tape  and  9'j{x)  =  1  for  all 
X.  If  we  perform  this  sampling  2dlog(l/^)  times  for  each  value  of  j  then  with  probability 
at  least  1  —  ^  one  of  the  Xa’s  returned  will  be  a  good  weak  approximator.  Finally,  we  will 
simulate  EX(f)  in  order  to  test  whether  or  not  a  given  Xo  is  a  good  weak  approximator. 
This  gives  us 
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Theorem  2  DNF  is  quantum  leamable  with  respect  to  uniform. 

Also,  PTi,  the  class  of  functions  expressible  as  a  threshold  of  a  polynomial  number  of 
parity  functions,  has  the  property  that  every  PTi  can  be  weakly  approximated  by  a  parity 
with  respect  to  any  fixed  distribution  D  [12].  This  was  the  only  property  of  DNF  that  we 
used  in  the  above  arguments.  Therefore 

Theorem  3  PT  i  is  quantum  learnable  with  respect  to  uniform. 

4  Membership  Oracle  vs.  Quantum  Example  Oracle 

In  practice,  it  is  not  clear  how  a  quantum  example  oracle  could  be  constructed  without 
using  a  membership  oracle.  Furthermore,  because  a  QTM  uses  interference  over  an  entire 
superposition  to  perform  its  computations,  it  might  seem  that  perhaps  there  is  some  way 
to  simulate  a  membership  oracle  given  only  a  quantum  example  oracle  by  choosing  a  clever 
interference  pattern.  In  this  section  we  show  that  this  is  not  the  case. 

Definition  4  We  say  that  membership  queries  can  be  quantum-example  simulated  for  func¬ 
tion  class  P  if  there  exists  a  BQP  algorithm  A  and  a  distribution  D  such  that  for  all  f  E  P 
and  all  x,  running  A  on  input  x  with  quantum  example  oracle  QEX{f,D)  produces  f{x). 

Theorem  5  Membership  queries  cannot  be  quantum-example  simulated  for  DNF. 

Before  proving  this  theorem,  we  develop  some  intuition.  Consider  two  functions  /o  and 
/i  that  differ  in  exactly  one  input  x.  Then  the  superpositions  returned  by  QEX(fo,  D)  and 
QEX(f\,D)  are  very  similar  for  “almost  all”  D.  In  particular,  if  we  think  of  superpositions 
as  vectors  in  an  inner  produce  space  of  dimension  2",  then  there  is  in  general  an  exponentially 
small  angle  between  the  superpositions  generated  by  these  two  oracles.  This  angle  will  not 
be  changed  by  unitary  transformations.  So  in  general,  an  observation  will  be  unable  to 
detect  a  difference  between  the  superpositions  produced  by  QEX{fo,  D)  and  QEX{fi,  D). 
Therefore  a  BQP  algorithm  with  only  a  quantum  example  oracle  QEX(fi,  D).  i  G  {0.  1}. 
will  be  unable  to  correctly  answer  a  membership  query  on  x  for  both  /o  and  f\ . 

We  now  present  two  lemmas  that  will  help  us  to  formalize  this  intuition. 

Lemma  6  Let  A  be  a  quantum  algorithm  that  makes  at  mo.st  t  calls  to  QEX{f,  D).  Then 
there  is  an  equivalent  quantum  program  (modulo  a  polynomial  slowdown)  that  makes  all  t 
calls  at  the  beginning  of  the  program. 

Proof:  Let  H  he  &  unitary  matrix  representing  an  arbitrary  move  p  by  the  QTM  M .  Then 
if  M  is  initially  in  a  superposition  Yly  performs  the  move  p,  and  then  calls  QEX(f.  D). 
the  resulting  superposition  will  be 

^  ^JD{x)ayH[^yy]\z,x), 

x,y^z 

where  |y|  <  \z\  <  [j/l  +  1  (we  are  assuming  without  loss  of  generality  a  standard  bit-string 
encoding  of  configurations).  But  notice  that  there  is  a  machine  M'  that  first  calls  QEX, 


shifts  X  one  cell  to  the  right  (if  necessary),  and  then  simulates  the  move  fi.  M'  is  at  most 
polynomially  slower  th^  M  and  produces  the  same  tape  configuration  given  above.  A  simple 
inductive  argument  completes  the  proof.  □ 

Before  presenting  the  next  lemma,  we  need  several  definitions. 

Definition  7  Define  Obs  over  any  linear  combination  of  configurations  (i.e.,  we  no  longer 
require  that  the  sum  of  squared  amplitudes  be  1)  as 

\  I  /  x:xi=l 

Define  the  length  of  a  linear  combination  of  configurations  S  =  5Zj;«r|x)  to  be  ||S'||  = 
IZx  |«xp.  Dor  any  linear  combination  of  configurations  S  we  define  for  i  E  {0, 1} 

x:jri=t 

Lemma  8  Let  Si  and  S2  be  superpositions  and  let  S  be  any  linear  combination  of  configu¬ 
rations.  Let  W  be  any  quantum  operation.  Then 

1.  Obs{S)  <  ||5|1. 

5.  ||W^5||  =  1|5||. 

3.  lObs(Si)  -  Obs(S2)l  <  Obs(Si  -  S2). 

Proof:  For  1.  we  have 

Ois(S)  =  ||SI"11  <  ||5'“'||  +  ||S<''||  =  ||S||. 

2.  follows  from  the  fact  that  VF  is  a  unitary  operation  that  preserves  length. 

To  prove  3.  we  have 

lOM-S.)  -  06<|(S,)|  =  |||5r'||-||5S"||| 

<  l|5r’-5'"|l 

=  Obs{Si  —  82)- 


□ 

Proof  of  Theorem  5:  By  Lemma  6,  we  can  assume  that  all  of  the  calls  to  QEX 

occur  at  the  beginning  of  the  program.  Now  suppose  M  with  QEXj^o  can  quantum-simulate 
MEMf.  Take  /(x)  =  0  and  g{x)  =  Xi‘  A  ...  A  x^”  where  x*^  =  1  if  and  only  x  =  d.  The 
second  function  is  zero  in  all  points  except  in  c  =  (ci, . . . , c„).  We  want  to  use  the  simulator 
to  find  h{c).  The  simulator  will  first  make  t  calls  to  QEX^.d  for  h  6  {/,<?}  giving 

Sh=  Yi  yj ■  ‘  ■  D>{zt)\zu  h{zi), . . . ,  zi,  h{zt)). 

. . It 
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After  that,  the  computation  for  /  and  g  is  the  same.  The  superpositions  ior  h  =  f  and  k  =  g 
differ  only  in  configurations 

1^1 , /l( 2} 

where  one  of  the  Zi  is  c. 

Therefore 

Sf  =  Sg+  Ef^g, 

where 

Eu  =  E  . 

(3i)zi=c 

. zug(zt)). 

(3i)zi=c 

If  W  is  the  computation  after  the  oracle  calls  then  we  observe  Obs{WSf)  and  06s(VF55)  for 
/  and  g,  respectively.  By  Lemma  8 

\Obs{WSj)-Obs{WSg)\  < 

< 


< 

For  almost  all  points  c,  D{c)  <  Ifpoly(n)  for  any  poly(n).  For  all  such  c  our  observation 
is  indistinguishable.  □ 

Note  that  while  there  are  restrictions  on  the  unitary  matrix  representing  the  transitions 
of  a  quantum  Turing  machine,  we  did  not  rely  on  these  restrictions  in  the  proof  of  this  theo¬ 
rem.  Thus  we  have  2w:tually  proved  the  stronger  result  that,  even  given  the  ability  to  apply 
arbitrary  unitary  operations  to  superpositions,  it  is  not  possible  to  simulate  membership 
queries  in  polynomial  time  given  only  a  quantum  example  oracle.  On  the  other  hand,  it  is  a 
simple  matter  to  simulate  a  uniform  quantum  example  oracle  with  membership  queries. 

Lemma  9  For  every  Boolean  function  f,  QEX{f,U)  can  be  simulated  by  MEM{f). 

Proof:  QEX(f,U)  can  be  simulated  by  applying  the  Fourier  transform  F  to  the  tape  |0) 
and  applying  /.  □ 

Thus  a  membership  or2icle  for  /  is  strictly  more  powerful  than  a  uniform  quantum  ex¬ 
ample  oracle  for  /. 

5  Learning  Noisy  DNF’s 

Recently,  Blum  et  al.  [6]  have  shown  that  DNF  is  not  learnable  with  respect  to  uniform  in 
the  statistical  query  (SQ)  model  of  learning.  The  SQ  model  is  apparently  a  good  model  of 


ObsiWEf,g) 

WWEjJ 

2  Yi  (\/D{z,)--D(z,) 

(3<)z<=c  ^ 

2(l-(l-£>(c))‘) 

2tD{c). 


11 


learning  with  classification  noise:  if  a  function  class  is  SQ  learnable  then  it  is  learnable  with 
classification  noise,  and  every  function  class  that  is  known  to  be  learnable  from  such  noise 
is  also  known  to  be  SQ  learnable  [14].  Therefore,  the  fact  that  DNF  is  hard  to  learn  in  the 
SQ  model  would  seem  to  be  strong  evidence  that  DNF  is  not  PAC  learnable  from  a  noisy 
example  oracle. 

However,  in  this  section  we  show  that  DNF  is  quantum  learnable  with  respect  to  uniform 
using  a  noisy  quantum  example  oracle.  We  define  such  an  oracle  as  follows:  given  a  noise  rate 
T/,  a  noisy  quantum  example  oracle  for  /,  QEX''(f,D),  is  exactly  like  the  quantum  example 
oracle  QEX{f,D)  except  that  each  of  the  example  labels  is  reversed  with  probability  q. 
Each  call  to  the  QEX'^if,  D)  chooses  which  labels  to  flip  independently  of  all  previous  calls. 
We  say  that  a  function  class  is  learnable  using  a  noisy  quantum  example  oracle  QEX’^(f,  D) 
if  the  cltiss  is  learnable  in  time  polynomial  in  1/(1  —  Itj)  as  well  as  the  standard  parameters 
(we  iissume  for  simplicity  that  q  is  known).  The  probability  Prolf  =  /*]  of  success  is  taken 
over  the  random  noise  choices  made  by  QEX'^  as  well  as  any  random  choices  made  by  the 
learning  algorithm. 

We  now  state  and  prove  the  main  result  of  this  section. 

Theorem  10  DNF  is  quantum  learnable  with  respect  to  uniform  using  a  noisy  quantum 
example  oracle. 

Proof;  Let  E,,[/xa]  represent  the  expected  value  of  fxa  with  respect  to  the  uniform  distri¬ 
bution  on  X  given  that  f{x)  is  produced  by  a  noisy  oracle  (quantum  or  otherwise)  having 
noise  rate  q.  Using  a  technique  due  to  Kearns  [14],  it  can  be  shown  that 

E,[fXa]  =  {l-2q)E[fXa]. 

Now  consider  running  the  quantum  DNF  algorithm  developed  earlier  using  a  noisy  ouan- 
tum  example  oracle.  The  component  of  this  algorithm  impacted  by  the  noisy  oracle  is  the 
sampling  of  Xa's  to  locate  a  weak  approximator  for  /.  Recall  that  when  we  used  a  noiseless 
oracle  then  we  sampled  each  Xa  with  probability  proportional  to  E^[/\a].  From  the  above 
expression  we  see  that  the  effect  of  the  noise  is  to  reduce  the  expected  values  we  see  when 
using  the  noisy  oracle  by  a  value  that  is  inverse  polynomial  in  our  allowed  running  time. 
Thus  by  increasing  the  number  of  samples  by  an  appropriate  amount  we  will  still  be  able  to 
find  the  desired  Xo’s.  □ 
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